Statistical inference, part III

Point and interval estimates

Eva Freyhult

NBIS, SciLifeLab

April 8, 2025

Point estimate

  • Unknown population parameters can be inferred from estimates from a random sample.

Point estimate

  • Unknown population parameters can be inferred from estimates from a random sample.

Point estimate

  • Unknown population parameters can be inferred from estimates from a random sample.
  • The sample estimate will be our best guess, a point estimate, of the population parameter.

  • The sample proportion and sample mean are unbiased estimates of the population proportion and population mean.

  • The expected value of an unbiased point estimate is the the population parameter that it estimates.

  • The sample estimate is our best guess, but it will not be without error.

Bias and precision

Figure 1: Bias and precision.

Interval estimates

  • To show the uncertainty an interval estimate for a population parameter can be computed based on sample data.

  • An interval estimate is an interval of possible values that with high probability contains the true population parameter.

  • The width of the interval estimate can be determined from the sampling distribution.

Bootstrap interval

  • If the sampling distribution is unknown, a bootstrap interval can be computed instead.

  • Bootstrap is to use the data we have (our sample) and sample repeatedly with replacement from this sample.

  • Put the entire sample in an urn and resample!

Bootstrap interval

Sample with replacement many times.

Bootstrap interval

Plot the distribution of bootstrapped means.

Bootstrap interval

For a 95% bootstrap interval, compute the 2.5 and 97.5 percentiles.

Confidence interval

A confidence interval is a type of interval estimate associated with a confidence level.

An interval that with probability \(1 - \alpha\) cover the population parameter \(\theta\) is called a confidence interval for \(\theta\) with confidence level \(1 - \alpha\).

Sampling distribution of mean

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

Sampling distribution of mean

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

Sampling distribution of mean

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

Sampling distribution of mean

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

Confidence interval

Based on a random sample compute the sample mean \(m\).

Use what is known about the sampleing distribution and compute a confidence interval around \(m\).

Confidence interval of mean

If \(\sigma\) is known

\[Z = \frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0, 1)\]

Standard normal distribution

\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]

Standard normal distribution

\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]

\(z_{\alpha/2}\) is the value such that \(P(Z \geq z_{\alpha/2}) = \frac{\alpha}{2} \iff P(Z \leq z_{\alpha/2}) = 1 - \frac{\alpha}{2}\).

For a 95% confidence, \(\alpha = 0.05\), and \(z_{\alpha/2} = 1.96\). For 90% or 99% confidence \(z_{0.05} = 1.64\) and \(z_{0.005}=2.58\).

Confidence interval of mean

If \(\sigma\) is known

\[Z = \frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0, 1)\] From the standard normal distribution we know;

\[P(-z_{\alpha/2}<Z<z_{\alpha/2}) = 1-\alpha\]

\[P(-z_{\alpha/2}<\frac{\bar X-\mu}{SEM}<z_{\alpha/2}) = 1-\alpha\]

\[P(\mu-z_{\alpha/2}SEM<\bar X<\mu+z_{\alpha/2}SEM) = 1-\alpha\]

\[P(\bar X-z_{\alpha/2}SEM<\mu<\bar X+z_{\alpha/2}SEM) = 1-\alpha\]

Replace with an observed sample mean, \(\bar x\).

\[P(\bar x_{obs}-z_{\alpha/2}SEM<\mu<\bar x_{obs}+z_{\alpha/2}SEM) = 1-\alpha\]

Confidence interval of mean

If \(\sigma\) is known

\[Z = \frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0, 1)\]

The confidence interval with confidence level \(1-\alpha\);

\[[\bar x_{obs} - z_{\alpha/2}SEM, \bar x_{obs} + z_{\alpha/2}SEM]\]

or

\[\mu = \bar x_{obs} \pm z_{\alpha/2}SEM\] where \(SEM = \frac{\sigma}{\sqrt{n}}\).

Confidence interval of mean

The mean of a sample of \(n\) independent and identically normal distributed observations \(X_i\) is normally distributed;

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

If \(\sigma\) is unknown and \(n\) is small?

Use the statistic \(t=\frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{s}{\sqrt{n}}} \sim t(n-1)\), t-distributed with \(n-1\) degrees of freedom.

It follows that

\[ \begin{aligned} P\left(-t < \frac{\bar X - \mu}{\frac{s}{\sqrt{n}}} < t\right) = 1 - \alpha \iff \\ P\left(\bar X - t \frac{s}{\sqrt{n}} < \mu < \bar X + t \frac{}{\sqrt{n}}\right) = 1 - \alpha \end{aligned} \]

The confidence interval;

\[[\bar x_{obs} - t \frac{s}{\sqrt{n}}, \bar x_{obs} + t \frac{s}{\sqrt{n}}]\]

or

\[\mu = \bar x_{obs} \pm t \frac{s}{\sqrt{n}}\]

Confidence interval of mean

The confidence interval with confidence level \(1-\alpha\) is thus;

\[\mu = \bar x_{obs} \pm t \frac{s}{\sqrt{n}}\]

For a 95% confidence interval and \(n=5\), \(t=\) 2.7764.

The \(t\) values for different values of \(\alpha\) and degrees of freedom are tabulated and can be computed in R using the function qt.

n=5
alpha = 0.05
## t value
qt(1-alpha/2, df=n-1)
[1] 2.776

Example

You study the BMI of male diabetic patients. In a sample of size 6 you observe; \(27, 25, 31, 29, 30, 22\). Assume that the BMI is normally distributed and calculate a 95% confidence interval for the mean BMI in male diabetic patients.

The sample mean is \(\bar x = 27.3\) and the sample standard deviation is \(s = 3.39\). The degrees of freedom is \(n-1 = 5\) and the \(t\) value for a 95% confidence interval is 2.5706.

The confidence interval is \(\bar x \pm t \frac{s}{\sqrt{n}} = 27.3 \pm 2.57 \frac{3.39}{\sqrt{6}} = 27.3 \pm 3.5\).

x <- c(27, 25, 31, 29, 30, 22)
(m <- mean(x))
[1] 27.33
(s <- sd(x))
[1] 3.386
(q <- qt(0.975, df=5))
[1] 2.571
(SEM <- s/sqrt(6))
[1] 1.382
(CI <- m + c(-1,1)*q*SEM)
[1] 23.78 30.89

Confidence interval of proportions

Remember that we can use the central limit theorem to show that

\[P \sim N\left(\pi, SE\right) \iff P \sim \left(\pi, \sqrt{\frac{\pi(1-\pi)}{n}}\right)\]

It follows that

\[Z = \frac{P - \pi}{SE} \sim N(0,1)\] Based on what we know of the standard normal distribution, we can compute an interval around the population property \(\pi\) such that the probability that a sample property \(p\) falls within this interval is \(1-\alpha\).

Confidence interval of proportion

\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\\ P(-z_{\alpha/2} < \frac{P - \pi}{SE} < z_{\alpha/2}) = 1 - \alpha\]

We can rewrite this to

\[P\left(\pi-z_{\alpha/2} SE < P < \pi + z_{\alpha/2} SE\right) = 1-\alpha\] In words, a sample fraction \(p\) will fall between \(\pi \pm z_{\alpha/2} SE\) with probability \(1- \alpha\).

The equation can also be rewritten to

\[P\left(P-z SE < \pi < P + z SE\right) = 1 - \alpha\]

Confidence interval of proportion

The observed confidence interval is what we get when we replace the random variable \(P\) with our observed fraction,

\[p-z SE < \pi < p + z SE\] \[\pi = p \pm z SE = p \pm z \sqrt{\frac{p(1-p)}{n}}\]

Confidence interval of proportion

The 95% confidence interval \[\pi = p \pm 1.96 \sqrt{\frac{p(1-p)}{n}}\]

Confidence interval of proportion

A 95% confidence interval will have 95% chance to cover the true value.

Confidence interval of proportion

Back to our example of proportion pollen allergic in Uppsala. \(p=0.42\) and \(SE=\sqrt{\frac{p(1-p)}{n}} = 0.0494\).

Hence, the 95% confidence interval is \[\pi = 0.42 \pm 1.96 * 0.05 = 0.42 \pm 0.092\] or \[(0.42-0.092, 0.42+0.092) = (0.32, 0.52)\]